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T-H We present a new method to extreict parton distribution functions from high 

^ energy experimental data based on a specific type of neural networks, the 

f~ — Self-Organizing Maps. We illustrate the features of our new procedure that 

arc particularly useful for an anaysis directed at extracting generalized parton 
distributions from data. We show quantitative results of our initial analysis of 
the parton distribution functions from inclusive deep inelastic scattering. 
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1. Introduction 



High energy processes can be described within Quantum Chromodynam- 
ics (QCD) through its factorization properties. These allow us to write 
the cross sections by separating the hard part, which is the well defined 
interaction of a quark with e.g. a virtual photon, from the soft part, de- 
scribed in terms of Parton Distributions Functions (PDFs) . The latter can- 
not be calculated from first principles. Information about the PDFs can 
only be obtained directly from experiment. Several collaborative groups in 
the; past decades have sought to use fitting techniques to study the PDFs 
behavior by devising models/analytic forms for the various PDFs, fi{x, Q^), 
i = u,d,s., c, b, t, g, at light cone momentum fraction x, and scale Q^, whose 
parameters are constrained by experiment. 

More recently, new types of partonic distributions that improve consid- 
erably our ability to study the nucleon structure and its partons dynamics 
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became accessible experimentally. The new distributions extend, in differ- 
ent ways, the concept of PDFs to observables that can measure the spin, 
spatial and momentum correlations in hadrons. These are the Transverse 
Momentum Distributions, TMDs Ref.4, and the Generalized Parton Distri- 
butions, GPDs (see review in Rcf.5). For GPDs and TMDs one increases 
the range of kinematical variables that need to be measured by introduc- 
ing both transverse momenta, and the Fourier conjugates of the partonic 
spatial degrees of freedom. 

While for the PDFs one faces the problem of considering different re- 
sults/extractions that vary significantly from group to group, lacking a well 
defined theory to constrain them (for a review of recent results see e.g. 
Refs.1,2), for TMDs and more crucially for GPDs, standard global anal- 
yses are doubtful due to the increased number of variables compared to 
the scarcity of the experimental data (see e.g. Ref.3 for a state of the art 
analysis) . 

A relatively new approach to PDFs fitting is the one proposed by the 
NNPDF collaboration Ref.6, who have replaced the standard analytic forms 
of PDFs with a more complex Neural Network (NN) solution. The estimated 
uncertainties for NNPDF fits are larger than those of global fits, possibly in- 
dicating that the global fit uncertainties might be underestimated. In Ref.7 
a criticism was put forward about relying on purely automated fitting pro- 
cedures such as the ones used by NNPDF. A new specific type of neural 
network, the Self-Organizing Map (SOM) , was proposed. The main point is 
that since for NNPDFs the effect of modifying individual NN parameters is 
unknown, the result might not be under control in the extrapolation region, 
or in between the data points if the data arc sparse. This issue is even more 
important when extending the fitting procedures to a wider set of semi- 
inclusive and exclusive observables. We therefore pushed forward with the 
SOM method, and we improved the preliminary work in Rcf.7 by restruc- 
turing the original code in such a way that on one side, a fully quantitative 
error analysis can be implemented, and on the other we now have sufficient 
flexibility to allow for analyses of different observables, including the matrix 
elements for deeply virtual exclusive and semi-inclusive processes. Our first 
quantitative results for the unpolarized case using Next-to-Leading-Order 
(NLO) perturbative QCD in the MS scheme were presented in Ref.8. In 
this workshop we also discuss a possible procedure to extend SOMs to the 
extraction of GPDs. 
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2. Self-Organizing Maps 

Recent years have seen an outstanding growth in the usage of neural net- 
works as paradigms for computational methods. SOMs are a type of neural 
network developed by T. Kohonen in the '80s Ref.9 based on a topologi- 
cal mapping of the external environment onto the brain's internal neural 
connections. 

In SOMs the nodes/neurons - map cells - are tuned to a set of input 
signals/data/samples according to a form of adaptation. The various nodes 
form a topologically ordered map during the learning process. This feature 
constitutes perhaps the main strength of SOMs in that it allows one to 
obtain a simplified two-dimensional representation of complex data other- 
wise depending on a "hard to control" set of parameters (see Figure 1). 
In this respect SOMs can be considered as a non linear form of principal 
component analysis which is often used to analyze highly dimensional data. 

Another aspect that sets SOMs apart from other NNs is their learn- 
ing process. SOMs learn via unsupervised learning whereas the learning of 
generic artificial NNs is supervised. In supervised learning a set of examples 
is given, and the goal is to force the data to match the examples as closely 
as possible. A cost function is defined that measures the importance to miss 
or detect the correct result. During the learning process the cost function is 
minimized. In unsupervised learning the cost function is minimized without 
introducing a definite set of examples, but just by similarity relations, or 
by finding how the data cluster or self-organize. Theoretical studies of the 
mechanism for map evolutions are in progress. In phenomenology, many 
new uses, one of which described below, might derive from this property. 

SOMs are built as two dimensional arrays whose cells get sensi- 
tized/tuned to a specific set of input signals according to a given order. 
To illustrate this in Fig.l we show the simple example of a "color map", 
where the input signals are represented by the three colors to be associated 
with each one of the map cells. 

The SOM algorithm consists of three stages: i) Initialization; ii) Train- 
ing; Hi) Mapping. 

During the initialization procedure weight vectors of dimension n are 
associated to each cell i: 



Vi are given spatial coordinates, i.e. one defines the geometry/topology of 
a 2D map that gets populated randomly with V^. For the training, a set of 




(1) 
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Fig. 1. A simple illustration of a 20 X 20 square map, and its input signals represented 
by colors. The arrows indicate the signal to cell matching procedure. In our more realistic 
situation, the map and signals will constitute of either PDFs or DIS structure functions. 

input data 

(isomorphic to Vi) is then presented to Vi, or compared via a "similarity 
metric" that we choose to be 

4 = 1,2 

The most similar weight vector is the Best Matching Unit (BMU). 

SOMs are based on unsupervised and "competitive" learning. This 
means that the cells that are closest to the BMU activate each other in 
order to "learn" from ^. Practically, they adjust their values according to 

V^in + 1) = V,in) + K^inMn) ~ V,{n)], (1) 

where n is the iteration number, and ft.ci(n) is the "neighborhood function" 
defining a radius on the map which decreases with both n, and the distance 
between the BMU and node i. In our case we use square maps of size Lmap, 
and 

= 1.5 ( — - ) Lmap (2) 

\ 'Strain / 

where n-train is the number of iterations. At the end of a properly trained 
SOM, cells that are topologically close to each other will contain data which 
are similar to each other. In the final phase the actual data are distributed 
on the map and clusters emerge. (Note that the specific location of the 
clusters on the map is not relevant and will not necessarily be the same 
from one run to another; only the clustering properties are important.) 
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Since each map vector now represent a class of similar objects, the SOM 
is an ideal tool to visualize high-dimensional data, by projecting it onto a 
low-dimensional map clustered according to some desired similarity feature. 

3. SOMPDF Parametrization 

In Rcf.Ref.7 a SOM algorithm was constructed which, together with a Ge- 
netic Algorithm (GA) was applied to PDF fitting. In this initial work it was 
proven that the SOM method works well as a minimization technique, in 
that it was shown that the x^/d.o.f. decreased towards unity with each GA 
iteration. However problems connected with the smoothing of the functions 
remained, related to the stochastic procedure used to generate them and to 
initialize and train the map. These prevented a fully quantitative analysis 
of the data, inchiding uncertainty evaluations. 

A solution to this problem was obtained in the present work Ref.8,10 
by developing a new version of the SOMPDF code where several important 
improvements were introduced. By writing it in a single compiled language 
- fortran 95 - the code was made more flexible and faster since a parallel 
(MPI) version was easily implemented. As explained in more detail below, 
the increased flexibility and speed allow us to perform now random varia- 
tions on the parameters of the various input PDFs, instead of on the PDF 
values themselves, thus providing continuous solutions which are amenable 
to standard error analyses. 

Our ultimate goal is to provide a procedure to extract the GPDs, and 
their related obscrvables, including e.g. the spatial partonic d.o.f. from ex- 
perimental data. We therefore added flexibility in the code's inputs so as 
to accommodate the additional kinematical variables and parameters. 

We now describe the main fitting procedure, originally applied to PDFs. 
This first phase of our work involves only marginally the clustering and 
visualization properties of SOMs. For PDFs this is, in fact, barely needed 
since the inclusive scattering databases provide a rather large kinematical 
coverage, and only two kinematical variables are necessary to describe the 
data. SOMs are used mostly to realize random variations of the PDF curves. 

Our fitting procedure is based on the initialization, training, and map- 
ping steps described for a general case in Section 2. In a nutshell, a set 
of database/input PDFs is formed by selecting at random from a range 
of existing PDF functions and varying their parameters. Baryon number 
and momentum sum rules are imposed at every step. These input PDFs, 
evolved to NLO to the desired Q^, are used to initialize the map. A subset 
of input PDFs is then used to train the map. Notice that this phase of the 
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procedure implies defining specific criteria for selecting the PDFs that will 
define each subset. Two possible criteria are discussed in Ref.7. 

The similarity is tested by comparing the PDFs, according to Eqs.(l) 
and (2) at given x and values. The new map PDFs are obtained by 
averaging the neighboring PDFs with the BMU PDFs. Once the map is 
trained a GA is implemented. The P^r input PDFs is calculated with 
respect to the experimental data. We then take a subset of these functions 
with the best x^i ^md use them as seeds to form a new set of input PDFs. 
We train the map with the new set of input PDFs and repeat the process. 
The 'was found to decrease monotonically towards = 1 with every GA 
iteration (Fig. 2). Our stopping criterion is established when the stops 
varying - its curve flattens. 



Iterations 



Fig. 2. Illustration of the behavior of the x^/d.o.f for one of our SOMPDF runs. 

Our first runs, presented in Figure 3, used a "test" set of data from DIS 
consistent with the sets used in Refs.6,11,12. The data sets chosen were 
from BCDMS, HI, NMC, SLAG and ZEUS (see e.g. references in Ref.6). 
The shaded areas in the figure represent our error bands. In our preliminary 
runs we defined a statistical error on an ensemble of SOMPDF runs. 

Finally, with a fully working procedure in hand, we comment on our 
capability to extend the PDF analysis to the GPD case. Here one has a total 
of 8 observables per parton component, given by the real and imaginary 
part of the Gompton Form Factors (GFFs) containing the corresponding 
GPDs Ref.5. The number of observables that has been defined e.g. in Ref.3 
is 17 of which 6 appear at Leading Order. An open question is therefore 
to establish which experiments, observables, and with what precision they 
can determine the various GPD components. Our analysis will provide a 
step in this direction in that by applying the clustering properties of the 
SOMs we will be able to quantitatively discern the various GFFs, that will 
become "dimensions" in our analysis, and to establish the sensitivity of the 
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Fig. 3. Upper panel: Test results using a 5 X 5 map for a set of 43 runs. The panels 
represent (clockwise from upper left)the Uy, s, dy, u distributions, respectively, at = 
7.5 GeV'^. The shaded areas are our results including the error analysis outlined in the 
text. For comparison we show also results from several other parametrizations. Lower 
panel: Uv, u and gluon distributions at = 1.3 GeV^, along with their variations 
obtained for x^/ d.o.f. < 1.2, from our previous SOMPDF analysis. Left: 5x5 map, 
right: 15 X 15 map. These figures are shown to illustrate the improvement in the curves 
smoothing attained in our new procedure (adapted from Ref.7). 

different experiments to each one of them. An ihustration of how clustering 
of GPDs on the SOM might be represented is given in Fig. 4. More work is 
in progress Ref.lO. 

4. Conclusions 

In this work we described a new computational method based on Self- 
Organizing Maps for parametrizing nucleon PDFs. Future developments 
will also be directed at exploiting the full potential of SOMs that offer 
the capability of going beyond a fully automated procedure, by enabling 
one to control the fitting procedure at each step. The selection of the best 
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Fig. 4. Extension of our analysis to GPDs. 



PDF candidates for the subsequent iteration could then be made based 
on the user's preferences instead of solely based on the . Our program 
can be extended to multivariable cases such as the Generalized Parton 
Distributions where the data are too sparse for stochastically generated, 
parameter-free, PDFs. 

We thank the University of Virginia Alliance for Computational Science 
and Engineering for computer time, and the HPC group at Jefferson Lab, 
in particular David Richards and Chip Watson, for allotting us space on 
their clusters. This work is partially supported by the U.S. Department of 
Energy grant DE-FG02-01ER4120 (S.L and D.Z.P.). 
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